Skip to content

ENH: add dropna argument to pivot_table #4106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 10, 2013

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Jul 2, 2013

fixes #3820

a = np.array(['foo', 'foo', 'foo', 'bar', 'bar', 'foo', 'foo'], dtype=object)
b = np.array(['one', 'one', 'two', 'one', 'two', 'two', 'two'], dtype=object)
c = np.array(['dull', 'dull', 'dull', 'dull', 'dull', 'shiny', 'shiny'], dtype=object)

In [11]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'], drop_na=False)
Out[11]:
b     one          two
c    dull  shiny  dull  shiny
a
bar     1      0     1      0
foo     2      0     1      2

Also same argument for pivot_table.

@jtratner
Copy link
Contributor

jtratner commented Jul 2, 2013

Question: Why did you choose drop_na instead of dropna which is the name of the data frame method?

@hayd
Copy link
Contributor Author

hayd commented Jul 2, 2013

@jtratner whoops!

@hayd
Copy link
Contributor Author

hayd commented Jul 3, 2013

I've also realised that doing something like np.testing.assert_equal(m[:3], m) doesn't raise (I guess I need to use np.testing.assert_equal(m[:3].values, m.values)...

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

@hayd assert_frame_equal? or assert_almost_equal?

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

ah didn't realize it was a multiindex. (internal representation is an empty ndarray)

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

imo len(some_multiindex) should behave as if some_multiindex is a 2d array.

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

any reason why that shouldn't be the case?

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

__len__ needs to match __iter__ (generally) - I need to look at it

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

I think it needs to be:

def __len__(self):
    return len(self.levels) * len(self.labels)

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

That way len(list(iter(multi_index))) == len(multi_index)

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

that doesn't match __iter__ since __iter__ is over each tuple. should be

def __len__(self):
    return len(self.values)

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

e.g., if there are 10 level and 10 labels then len will return 100 which makes sense as maybe a size attr like an ndarray, but really MultiIndex straddles the array of tuple interp and the 2d array interp

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

Oh, I thought len(self.labels) * len(self.levels) was equal to
len(self.values)?

On Tue, Jul 2, 2013 at 9:09 PM, Phillip Cloud [email protected]:

e.g., if there are 10 level and 10 labels then len will return 100 which
makes sense as maybe a size attr like an ndarray, but really MultiIndexstraddles the array of tuple interp and the 2d array interp


Reply to this email directly or view it on GitHubhttps://github.com//pull/4106#issuecomment-20389878
.

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

Nope, I'm wrong :P

On Tue, Jul 2, 2013 at 9:22 PM, Jeffrey Tratner
[email protected]:

Oh, I thought len(self.labels) * len(self.levels) was equal to
len(self.values)?

On Tue, Jul 2, 2013 at 9:09 PM, Phillip Cloud [email protected]:

e.g., if there are 10 level and 10 labels then len will return 100 which
makes sense as maybe a size attr like an ndarray, but really MultiIndexstraddles the array of tuple interp and the 2d array interp


Reply to this email directly or view it on GitHubhttps://github.com//pull/4106#issuecomment-20389878
.

@hayd
Copy link
Contributor Author

hayd commented Jul 3, 2013

@cpcloud no reason really... I was away from the internet when wrote the test case and forgot what it was.

@hayd
Copy link
Contributor Author

hayd commented Jul 6, 2013

Converted test case to example from previous issue.

cartesian_product is slightly slower (when passing to MultIndex) for small inputs but considerably faster for large ones:

In [23]: %timeit pd.MultiIndex.from_tuples(list(product(list('ABC'), [1, 2])))1000 loops, best of 3: 399 us per loop

In [24]: %timeit pd.MultiIndex.from_arrays(cartesian_product([list('ABC'), [1, 2]]))
1000 loops, best of 3: 541 us per loop
X = list('ABC' * 100)
Y = [1,2] * 100

In [27]: %timeit pd.MultiIndex.from_arrays(cartesian_product([X, Y]))
100 loops, best of 3: 8.1 ms per loop

In [28]: %timeit pd.MultiIndex.from_tuples(list(product(X, Y)))
10 loops, best of 3: 21.5 ms per loop

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

merge?

hayd added a commit that referenced this pull request Jul 10, 2013
ENH: add dropna argument to pivot_table
@hayd hayd merged commit 89b2f83 into pandas-dev:master Jul 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option for crosstab/pivot_table to include empty columns
4 participants